“It’s friday night!”
Customer behaviour, especially in the food and beverage industry is highly related to seasonality patterns. The owner wants to analyze the number of visitors (includes dine in, delivery, and takeaway transactions) so he could make better judgment in 2018. Fortunately, we already know that time series analysis is enough to provide a good forecast and seasonality explanation.
Library Usage :
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate) ## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
library(ggplot2)
library(TSstudio)
library(forecast)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(readr)
library(padr)
library(zoo)##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(ggsci)
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Read the source file : Train Data - data-train.csv and Test Data - data-test.csv.
fnb <- read_csv("data-train.csv")## Rows: 137748 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): receipt_number, item_id, item_group, item_major_group, payment_typ...
## dbl (3): quantity, price_usd, total_usd
## dttm (1): transaction_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View Train Data, Train Data will be used for Training and Validation
head(fnb,3)## # A tibble: 3 × 10
## transaction_date receipt_…¹ item_id item_…² item_…³ quant…⁴ price…⁵ total…⁶
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2017-12-01 13:32:46 A0026694 I10100… noodle… food 1 7.33 7.33
## 2 2017-12-01 13:32:46 A0026694 I10500… drinks bevera… 1 4.12 4.12
## 3 2017-12-01 13:33:39 A0026695 I10500… drinks bevera… 1 2.02 2.02
## # … with 2 more variables: payment_type <chr>, sales_type <chr>, and
## # abbreviated variable names ¹receipt_number, ²item_group, ³item_major_group,
## # ⁴quantity, ⁵price_usd, ⁶total_usd
The dataset includes information about:
- transaction_date: The timestamp of a transaction
- receipt_number: The ID of a transaction
- item_id : The ID of an item in a transaction
- item_group : The group ID of an item in a transaction
- item_major_group : The major-group ID of an item in a transaction
- quantity : The quantity of purchased item
- price_usd : The price of purchased item
- total_usd : The total price of purchased item
- payment_type : The payment method
- sales_type : The sales method
Below is the Preprocess Step which need to be done before conducting further analysis. Round datetime into hour, by using floor_date, save the result into ds column
fnb_clean <- fnb %>%
mutate(transaction_date=ymd_hms(transaction_date),
ds = floor_date(transaction_date, unit = "hours")) %>%
select(transaction_date, receipt_number, ds)Aggregate/summarise ds column (Date Time in hour) and receipt_number column to get the number of visitors
fnb_clean <- fnb_clean %>% group_by(ds) %>% summarise(visitor = n_distinct(receipt_number)) Time Series Padding, to fill missing data series (in hour), so each day time Series data will consist of 24 records - 24 hours / per days.
fnb_clean <- fnb_clean %>% pad()## pad applied on the interval: hour
Replace NA Value
fnb_clean <- fnb_clean %>% mutate(visitor = na.fill(visitor, 0))fnb_clean## # A tibble: 1,907 × 2
## ds visitor
## <dttm> <int>
## 1 2017-12-01 13:00:00 16
## 2 2017-12-01 14:00:00 38
## 3 2017-12-01 15:00:00 27
## 4 2017-12-01 16:00:00 29
## 5 2017-12-01 17:00:00 44
## 6 2017-12-01 18:00:00 50
## 7 2017-12-01 19:00:00 66
## 8 2017-12-01 20:00:00 70
## 9 2017-12-01 21:00:00 63
## 10 2017-12-01 22:00:00 63
## # … with 1,897 more rows
Check any NA and No Further NA found
# Check NA
colSums(is.na(fnb_clean))## ds visitor
## 0 0
Filter Date Time in hour (ds) to capture range time of Outlet Open, from 10.00 to 22.00
fnb_clean <- fnb_clean %>% filter(hour(ds) >=10 & hour(ds) <=22)Check Start & End of the time interval after filtering :
range(fnb_clean$ds)## [1] "2017-12-01 13:00:00 UTC" "2018-02-18 22:00:00 UTC"
tail(fnb_clean,10)## # A tibble: 10 × 2
## ds visitor
## <dttm> <int>
## 1 2018-02-18 13:00:00 25
## 2 2018-02-18 14:00:00 25
## 3 2018-02-18 15:00:00 27
## 4 2018-02-18 16:00:00 34
## 5 2018-02-18 17:00:00 26
## 6 2018-02-18 18:00:00 31
## 7 2018-02-18 19:00:00 56
## 8 2018-02-18 20:00:00 67
## 9 2018-02-18 21:00:00 49
## 10 2018-02-18 22:00:00 41
Create Time Series Object using ts() function with frequency 13, where 13 is time range of the outlet open everyday, from 10.00 to 22.00.
fnb_ts <- ts(fnb_clean$visitor, frequency = 13)Decompose and Plotting to inspect Seasonal, Trend and Irregular component of time series
fnb_ts_dec <- fnb_ts %>% decompose(type = "additive")
fnb_ts %>% tail(13*7*4) %>% decompose() %>% autoplot()
Explanation :
From above plot, the Estimated Trend component is showing a pattern, where it might contains un-captured extra seasonality and this can be considered as multi-seasonal data. To solve multi-seasonality, we need to convert the data into “Multiple Seasonality Time Series” which accept multiple frequency setting.
Create “Multi-seasonality Time Series” Object using msts() function, with Frequency : Daily (13) and Weekly (13*7), this will capture seasonality in Daily and Weekly. Then decompose and plotting.
fnb_msts <- msts(data = fnb_clean$visitor,seasonal.periods = c(13,13*7))
fnb_msts_dec <- mstl(fnb_msts)
fnb_msts %>% tail(13*7*4) %>% stl(s.window = "periodic") %>% autoplot()
Explanation :
Based on above Plot, now the Estimated Trend with “Multiple Seasonality Time Series” is more smoother and clearer. And also more clearer on Daily and Weekly Seasonality, more explainable for further analysis.
Create Visualization of Seasonality by hour from “Standard Time Series” Object ts() and provide explanation.
fnb_clean %>%
mutate(Hour = hour(ds), Seasonal = fnb_ts_dec$seasonal) %>%
distinct(Hour, Seasonal) %>%
ggplot() +
geom_bar(aes(x = Hour, y = Seasonal, fill = Seasonal), stat ="identity", width = 0.7)+
scale_fill_gradient(low = "black", high = "red") +
scale_x_continuous(breaks = seq(10,22,1)) +
labs(title = "Seasonality Analysis - Hourly")
Explanation :
Based on above plot, Peak Time range of the Outlet is between 19.00 - 22.00, and at 20.00 is the time where most Visitor come. And at 10.00 is the time where the least visitors come to the Outlet as the Outlet just started/opened.
Create Visualization of Seasonality by week and hour from “Multi-Seasonal Time Series” Object msts() and provide explanation.
fnb_msts_dec_fr <- data.frame(fnb_msts_dec)
fnb_msts_dec_fr %>%
mutate(ds = fnb_clean$ds) %>%
mutate(Day_of_Week = wday(ds, label = TRUE, abbr = FALSE), Hour = (hour(ds))) %>%
group_by(Day_of_Week, Hour) %>%
summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
ggplot() +
geom_bar(aes(x = Hour, y = Seasonal, fill = Day_of_Week),stat ="identity", position = "stack", width = 0.7)+
scale_x_continuous(breaks = seq(10,22,1)) +
scale_fill_locuszoom()+
labs(title = "Multi-Seasonality Analysis - Weekly & Hourly")## `summarise()` has grouped output by 'Day_of_Week'. You can override using the
## `.groups` argument.
Explanation :
Based on above plot, Peak Time range of the Outlet is between 19.00 - 22.00 every day of the week, and On Saturday at 20.00 is the time where most Visitor come. And at 10.00 - every day of the week, is the time where the least visitors come as the Outlet just started/opened.
We will split our train data into two type of data set named train and validation’. Ourvalidation` data will be a set of data consist of the last seven days of the restaurant operational hour.
# Cross Validation
fnb_test_msts <- tail(fnb_clean, 13*7)
fnb_test_msts$visitor <- round(fnb_test_msts$visitor)
fnb_train_msts <- head(fnb_clean, nrow(fnb_clean) - 13*7)
fnb_train_msts$visitor <- round(fnb_train_msts$visitor)
# Plot data Train & Validation
Plot <- fnb_train_msts %>%
ggplot(aes(x = ds, y = visitor)) +
geom_line() +
scale_x_datetime(name = "Transaction Date",
date_breaks = "2 weeks",
expand = expansion(mult = 0.05, add = 0.05)) +
scale_y_continuous(breaks = seq(0, 400, 50), expand = expansion(mult = 0.05, add = 0.05)) +
geom_line(data = fnb_test_msts,
aes(color = "red"),
show.legend = T)
ggplotly(Plot)The graphic above shows us the composition of our data:
train data started from 2017-12-01 into 2018-02-11 validation data started from 2018-02-12 into 2018-02-18
After we succeed creating the MSTS object, we will try to put our object into several models. Some models that we will try to create is
- Multi-Seasonal ARIMA model
- Multi-Seasonal Holt-Winter
- TBATS Model
# Create Model
model_arima <- stlm(fnb_msts, method = "arima")# Create Model
model_hw <- stlm(fnb_msts, method = "ets")# Create Model
model_tbats <- fnb_msts %>%
tbats(use.box.cox = F,
use.trend = T,
use.damped.trend = T)forecast_arima <- forecast(model_arima, h = 13*7)
forecast_hw <- forecast(model_hw, h = 13*7)
forecast_tbats <- forecast(model_tbats, h = 13*7)!. Multi-Seasonal ARIMA :
fnb_msts %>%
autoplot(series = "actual") +
autolayer(forecast_arima$fitted, series = "train") +
autolayer(forecast_arima$mean, series = "test") +
theme_minimal()
2. Multi-seasonal Holt-Winter
fnb_msts %>%
autoplot(series = "actual") +
autolayer(forecast_hw$fitted, series = "train") +
autolayer(forecast_hw$mean, series = "test") +
theme_minimal()
3. TBATS model
fnb_msts %>%
autoplot(series = "actual") +
autolayer(forecast_tbats$fitted, series = "train") +
autolayer(forecast_tbats$mean, series = "test") +
theme_minimal()Evaluate Models Performance
modelacc <- rbind(
accuracy(forecast_arima$mean, fnb_test_msts$visitor),
accuracy(forecast_hw$mean, fnb_test_msts$visitor),
accuracy(forecast_tbats$mean, fnb_test_msts$visitor))
rownames(modelacc) <- c("Multi-Seasonal ARIMA", "Multi-seasonal Holt-Winter", "TBATS model" )
modelacc## ME RMSE MAE MPE MAPE
## Multi-Seasonal ARIMA -1.384508 5.905591 4.569183 NaN Inf
## Multi-seasonal Holt-Winter 2.104726 6.000170 4.792508 Inf Inf
## TBATS model -1.026611 6.679003 5.370299 NaN Inf
Based on above Accuracy Summary, The Accuracy of “Multi-Seasonal ARIMA” is better, MAE is 4.569183 (lower than 6), this model will be chosen to be used for Prediction -> The Best Model is “ARIMA Model”.
Data Preparation for the Plot :
accuracyData <- data.frame(ds= fnb_clean$ds %>% tail(13*7),
Actual = as.vector(fnb_test_msts) ,
TBATSForecast = as.vector(forecast_tbats$mean),
ArimaForecast = as.vector(forecast_arima$mean),
HWForecast = as.vector(forecast_hw$mean))accuracyData %>%
ggplot() +
geom_line(aes(x = ds, y = Actual.visitor, colour = "Actual"),size=1)+
geom_line(aes(x = ds, y = ArimaForecast, colour = "Multi-Seasonal ARIMA Model "),size=1)+
labs(title = "Hourly Visitor - Actual Vs Multi-Seasonal ARIMA Model ",x = "Date",y = "Visitor",colour = "")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
# Visualization of Actual vs Estimated number of visitors (All
Models)
accuracyData %>%
ggplot() +
geom_line(aes(x = ds, y = Actual.visitor, colour = "Actual"),size=0.5)+
geom_line(aes(x = ds, y = HWForecast, colour = "Holt Winter Model"),size=0.1)+
geom_line(aes(x = ds, y = ArimaForecast, colour = "Arima Model (Best Model)"),size=0.5)+
geom_line(aes(x = ds, y = TBATSForecast, colour = "TBATS Model"),size=0.5)+
labs(title = "Hourly Visitor - Actual Vs All Models",x = "Date",y = "Visitor",colour = "")
# Prediction Performance ## MAE < 6 in validation dataset
The Best Model is “Arima Model” as it has MAE Accuracy 4.569183, the smallest MAE compare the other created models.
accuracy(forecast_arima$mean, fnb_test_msts$visitor)## ME RMSE MAE MPE MAPE
## Test set -1.384508 5.905591 4.569183 NaN Inf
##MAE < 6 in test dataset
Predict using “Multi-Seasonal Data” and save The Prediction into CSV
File submission-destian.csv by using The Best Model and
Forecast : “ARIMA Mode”. The Step :
knitr::include_graphics("MAE.png")fnb_test <- read_csv("data-test.csv")## Rows: 91 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## lgl (1): visitor
## dttm (1): datetime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fnb_test$visitor <- forecast_arima$meanwrite.csv(fnb_test, "submission-destian.csv", row.names = F)We will check the Autocorrelation and Normality of our new model residual with the same test we used previously
# Residual Autocorrelation
Box.test(model_arima$residuals, type = "Ljung-Box",)##
## Box-Ljung test
##
## data: model_arima$residuals
## X-squared = 0.00010908, df = 1, p-value = 0.9917
Conclucsion: p-value > 0.05, so we can assume there’s no-autocorrelation for residuals.
shapiro.test(model_arima$residuals)##
## Shapiro-Wilk normality test
##
## data: model_arima$residuals
## W = 0.99139, p-value = 9.359e-06
hist(model_arima$residual, breaks = 30)
p-value > 0.05 : Residual normally distributed
p-value < 0.05 : Residual not normally distributed
The p-value < 0.05 so it can be concluded that the residuals are not normally distributed. This can happen because the size of the data held to form the model is not large enough. However, it does not mean that the model has bad forecasting performance. In this case, you can add the amount of data to build the model so that the residuals can be normally distributed and the forecasting performance is better. In addition, because the assumption of normality is not met, the error value obtained is not constant at 0. This causes if a forecast is to be made on data with a longer horizon, the error will tend to be larger. To overcome this, every time there is new historical data, a forecasting model must be re-built.
fnb_test %>%
mutate(hour=hour(datetime),
seasonal=visitor,
wday=wday(datetime, label = T, abbr = T)) %>%
ggplot(aes(x = hour, y = seasonal))+
geom_col(aes(fill=wday))+
labs(title = "Seasonality across hour and weekdays", x = "Hour", y="Visitors")+
theme_minimal()+
theme(legend.position = "right")## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
the highest visitor come to restaurant is on 20.00 WIB. also the highest
day on Saturday